W9 Lab Assignment


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss

sns.set_style('white')

%matplotlib inline

High dimensional data

In the IMDb dataset, we have two dimensions (number of votes and rating). How about if we have high dimensional data? First, in many cases, the number of dimensions is not too large. For instance, the "Iris" dataset contains four dimensions of measurements on the three types of iris flower species. It's more than two dimensions, yet still manageable.

This dataset is also included in seaborn, so we can load it.


In [2]:
iris = sns.load_dataset('iris')
iris.head()


Out[2]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

We get four dimensions (sepal_length, sepal_width, petal_length, petal_width). One direct way to visualize them is to have a scatter plot for each pair of dimensions. We can use the pairplot() function in seaborn to do this.

Try the following code. What do you see?


In [3]:
sns.pairplot(iris)


Out[3]:
<seaborn.axisgrid.PairGrid at 0x17c482ad0b8>

We can also color the symbols based on species:


In [4]:
sns.pairplot(iris, hue='species')


Out[4]:
<seaborn.axisgrid.PairGrid at 0x17c4d0bf860>

The colors represent the three different iris species, so based on the colors, we can tell that when we draw a scatter plot of a pair of dimensions, whether the plot seperates out the species clearly or not. What do you think are the pair of dimensions that best seperate the species?

#TODO: provide your explanation. The pair of dimensions that best separate the species are as follows: 1. petal length - petal width. 2. petal width - sepal width. 3. petal length - sepal length.

PCA

The principal component analysis (PCA) is a nice dimensionality reduction method. The goal of dimensionality reduction is, of course, to reduce the number of variables (dimensions, measurements, columns).

For example, in the Iris dataset we have four variables (sepal_length, sepal_width, petal_length, petal_width). If we can reduce the number of variables to two, then we can easily visualize them. PCA offers one way to do this.

PCA is already implemented in the scikit-learn package, a machine learning library in Python, which should have been included in Anaconda. If not, to install scikit-learn, run:

conda install scikit-learn

or

pip install scikit-learn

Before running PCA, we need to transform the iris from DataFrame to Numpy's array object. DataFrame.values returns the Numpy representation of DataFrame.

Extract the four variable as X and species as Y:


In [5]:
X = iris.values[:, 0:4] # extract the 1st to the 3rd columns of all rows
Y = iris.values[:, 4] # extract the 4th column of all rows
# print(X)
# print(Y)

We can now perform PCA with the following code:


In [6]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # set the number of components to 2
X_r = pca.fit(X).transform(X)

#Make a dataframe with the results
df = pd.DataFrame(X_r, columns=['PC1', 'PC2'])
df['species'] = Y

Now we only have two dimensions. We can plot them again with the previous code:


In [7]:
sns.pairplot(df, hue='species')


Out[7]:
<seaborn.axisgrid.PairGrid at 0x17c4f9bc3c8>

Compare with the previous plot. What do you think PCA was doing? How did it reduce dimensionality to 2?

#TODO: provide your thoughts PCA reduces the dimensionality keeping the data property intact. Following steps are followed to reduce the dimensionality: 1. Center the data (shift each column so that mean = 0) 2. Calculate covariance matrix 3. Calculate eigenvalues and eigenvectors of the covariance matrix 4. If you sort the eigenvalues from largest to the smallest, the corresponding eigenvectors are the nth principal components.

t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is also tool to visualize high-dimensional data. The technique has become widespread in the field of machine learning, since it has an almost magical ability to create compelling two-dimensonal “maps” from data with hundreds or even thousands of dimensions.

Let's try it out with the iris data.


In [8]:
from sklearn.manifold import TSNE

In [9]:
from sklearn.datasets import load_iris

iris = load_iris()
X_tsne = TSNE(learning_rate=100, perplexity=30).fit_transform(iris.data)

In [10]:
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='Set1', s=30)


Out[10]:
<matplotlib.collections.PathCollection at 0x17c51067ba8>

The hyperparameter perplexity determines how to balance attention between local and global aspects of your data. Changing this parameter (default is 30) can cause drastic changes in the output:


In [11]:
X_tsne = TSNE(learning_rate=100, perplexity=10).fit_transform(iris.data)
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='Set1', s=30)


Out[11]:
<matplotlib.collections.PathCollection at 0x17c5109f630>

Experiment with a few different perplexity values. How do you think it influences the result?


In [12]:
#TODO: put your experiments and answers here.
for i in range(0, 150, 10):
    X_tsne = TSNE(learning_rate=100, perplexity = i).fit_transform(iris.data)
    plt.figure(figsize=(10, 5))
    plt.subplot(121)
    plt.title("Perplexity value: " + str(i))
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='Set1', s=30)


As perplexity increases 2D scale gets smaller. Data seems to spread out more. The perplexity arugment is related to number of nearest neighbors used in other manifold algorithm.